Visualizing Numerical Data

Screenshot taken from Coursera 7:14

Screenshot taken from Coursera 9:28

Screenshot taken from Coursera 10:06

Measures of Center

Screenshot taken from Coursera 4:19

Measures of Spread

variance

Screenshot taken from Coursera 2:22

standard deviation

Screenshot taken from Coursera 3:51

variability vs diversity

Screenshot taken from Coursera 4:27

Screenshot taken from Coursera 4:58

Answer

  • This time, the answer is actually Set 2. Remember, distributions where more observations are clustered around the center, are less variable, versus distributions where more observations are away from the center, are more variable.
  • We can take a look at dot plots of these distributions, to make that point a little more clear. In the first set, the average gas mileage is 30 miles per gallon, and the values range from ten to 50, but there is one observation at the mean and two others closer to the mean than the endpoints. In the second, the set, the average gas mileage is 26 miles per gallon and the values also range from ten to 50. But there are no observations at or near the mean. Therefore, the average deviation from the mean, is higher for this set.

Screenshot taken from Coursera 5:37

interquatile range (IQR)

Screenshot taken from Coursera 6:56

Screenshot taken from Coursera 6:56

Robust Statistics

  • We define robust statistics as measures on which extreme observations have little effect.
  • Let's give a quick example. We start with a small data set of values between one and six, and the mean and the median for these data are both 3.5. What if we change one of the values in the data set to be much larger? Say 1000. The mean increases greatly, but the median stays the same at 3.5. In other words, the mean is robust to the extreme observation. This is because while the mean depends on all observations in the data set, it is the arithmetic average, after all. The median only depends on the midpoint of the distribution and the values of the end point are irrelevant to its calculation. We just established that the median is a more robust statistic of center than the mean.
  • Going along with this the IQR, which is based on the median, is a more robust statistic than the standard deviation which is calculated using the mean. As well as range which relies solely on the most extreme observations.
  • Robust statistics are most useful for describing skewed distributions, or those with extreme observations. While non-robust statistics like mean and standard deviation are useful for describing symmetric distributions

Screenshot taken from Coursera 1:25

Transforming data

Screenshot taken from Coursera 0:40

  • The most commonly used transformation is the natural log transformation, which is often applied when much of the data cluster near zero relative to larger values in the dataset and all observations are positive. For example, we saw earlier that the distributions of income per person was heavily right skewed. But after applying a natural log transformation, the data become much more symmetric. Sometimes this type of data are much easier to model, because they are much less skewed and outliers are usually less extreme.

Screenshot taken from Coursera 0:57

  • Transformations can also be applied to one or both variables in a scatter plot to make the relationship between the variables more linear. And hence, easier to model with simple methods.
  • For example, here we have a scatter plot of income per person versus life expectation. The relationship is positive and curved. If we apply a log transformation to the response variable and then plot the relationship again, the relationship stays positive, but becomes more linear, which makes it easier to model than the untransformed data.

Screenshot taken from Coursera 1:23

  • Transformations other than the logarithm can be useful too.
  • Let's take a look at a new dataset. Here, we have a scatter plot of a random sample of cars weight versus their city mileage. We can see that the two variables are inversely related, which is expected. Cars that are bigger get fewer gallons to the mile, but the relationship is not linear. In addition to the log, we can also try a square root transformation where we plot the square root of the weight versus miles per gallon or the inverse transformation where we divide one by the weight of the car. It's difficult to tell just looking at these plots which transformation works better or if either of the transformation actually yield something better than the original data.
  • Later in the course, we'll get into a little more detail about how to make such a call. But for now, it's important to just realize that transformations can be useful even though they complicate the interpretations a bit. After all, log of income or the square root of weight are not easy to evaluate.

Screenshot taken from Coursera 2:04

Screenshot taken from Coursera 2:59